Synthetic Ontologies: A Hypothesis 


Towards More Flexible and Adaptable Knowledge Management Systems 


Introduction 


Data is becoming an increasingly valuable commodity 
that drives innovation, research, and development 
across all industries. The quality and structure of 
knowledge graphs depends on the ability to extract 
insights from data. The accuracy and efficiency of 


natural language processing algorithms play an 
important role in this context. While pre-defined 
ontologies have been a useful tool for organizing data, 
they are limited in their ability to capture the 
complexity and nuance of relationships between 
concepts. This essay explores a possibility to create 
knowledge graphs by developing alternative 
approaches for ontology engineering. It proposes the 
use of purely machine generated ontologies as a 
potential solution. This approach allows for the 
creation of ontologies that are not bound by the 
constraints of pre-defined structures, and can instead 
be tailored to the specific needs of a given use case. 


Problem Definition 


Data Exchange Standards as Key Value Drivers 
The emergence of the Internet age can be attributed to 
multiple factors. As Andrew Russell notes in his book 
"Open Standards and the Digital Age" [1], the 
development of communication standards stands out 
as a crucial catalyst. Much of the advancement can be 
traced back to the development of graph theory, which 
has facilitated the creation of highly efficient search 
algorithms, social networks, and recommendation 
systems that harness the power of graph analysis. 


In the early 1990s, John Barlow's seminal work 
"Cybernomics" [2] laid the groundwork for the 


information economy by outlining its key implications. 
The one that especially stands out is that “relationships 
replace things” — that a relationship is represented by 
an “active flow of information, the greater the flow the 
more valuable is the relationship.” The other key 
observation is that “transaction becomes continuous.” 
Three decades after the publication of “Cybernomics” 
we see the rise of the social networks. The other 
prediction — about continuous transactions — manifests 
itself in a multitude of various subscription models, and 
especially as software as a service, which seems to be 
the most popular business model in the IT industry. 
Thus, the flow of data is crucial for the economy and 
the creation of value. 


In the field of economics, the creation of value is 
attributed to labour, capital, and total factor 
productivity. The latter measures the overall efficiency 
of an activity. As we live in the information economy 
where the value of services prevails over the value of 
products, the total factor productivity plays a key role. 
At the same time, analysts, data scientists, consultants, 
etc. are well aware that roughly 80% of their time is 
spent transforming data into a format that is amenable 
for analysis. Thus, there is a growing need to find a 
solution to the multitude of standards used for data 
exchange to optimize inefficiencies. It is worth noting 
that, according to some estimates, around one-third of 
the data being generated worldwide daily is medical 
data making it the best candidate for trying out new 
approaches to data processing. Streamlining data 
exchange standards, we can unlock value in previously 
untapped areas of knowledge. 


Future Shock and Language Singularity 

Before the term “singularity” became a household 
item, a common term was “future shock” coined by 
Alvin Toffler in the book by the same title [3]. Back in 
1970, he observed an ever-increasing acceleration of 
change and information overload. Today the future 
shock has finally caught up with our language. The 
number of item sub-categories that get manufactured 
or provided as a service per unit of language increases; 
therefore, as the usage of any word becomes more 
frequent, its meaning collapses to the most frequent 
association not necessarily reflecting its “yesterday’s 
original meaning.” As a result, unification of standards 


is a challenge in many industries. The recent 


developments with large language models (LLMs) make 
this problem especially difficult in knowledge 
management. It can be argued that LLMs will produce 
a lot of noise and exacerbate the biases present in the 
data on which they were trained distorting the 
language further. 


The environment in of current data formats can be 
described by “VUCA.” This acronym stands for 
Volatility, Uncertainty, Complexity, and Ambiguity. It is 
a concept that originated in 1987 based on the 
leadership theories by Warren Bennis and Burt Nanus 
and quickly adopted by the field of military strategy and 
has since been applied to various contexts. A VUCA 
world is one that is characterized by rapid and 
unpredictable change, where traditional approaches to 
problem-solving and decision-making may no longer be 
effective. Organizations and individuals must be 
adaptable, resilient, and able to navigate complexity 
and ambiguity in order to succeed. The concept is also 
used to refer to the state of the global economy and 
challenges organizations face in the 21st century [4]. 


Thus, the problem can be distilled to finding a solution 
to developing ontologies extremely quickly, when new 
scientific discoveries make yesterday’s knowledge 
obsolete in a matter of months, if not weeks. 


Problem Analysis 


Two Key Problems with the Language Itself 

First of all, the nature of the language is different from 
the nature of reality. According to the work of Vasilii 
Nalimov titled “In the Labyrinths of Language: A 
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Mathematical Journey” [5], the key characteristic of 
any language, is its discrete nature. At the same time, 
the author points out the continuous nature of reality 
which creates some unresolvable issues with the use of 
a language to describe the world around us. 


Secondly, the language as a formal logical structure, is 
either internally inconsistent or incomplete if viewed as 
a closed set of Anyone familiar with 
Incompleteness theorems of 20" century logician Kurt 
Gédel understands that paradoxes or contradictions 
are inevitable in any closed logical system. 


rules. 


Ontologies.as a Meta-Language 

Ontologies came out of the branch of philosophy 
known as metaphysics, which deals with the nature of 
reality and existence. Since the 1970s, researchers 
recognized the need to develop the importance of 
ontologies for knowledge engineering and building 
powerful Al algorithms. Ontology in the field of 
computer science and information science is “a 
representation, formal naming, and definition of the 
categories, properties, and relationships between the 
concepts, data, and entities that substantiate one, 
many, or all domains of discourse” [6]. 


To spare the reader from the mathematical jargon, let’s 
dive right into the relationships between ontologies 
and knowledge graphs by resorting to a diagram. The 
figure below is a concept map that shows ontology 
from the point of view of computer science. To analyse 
the problem, the author adopted the approach 
documented by Huang (2022) [7]. 
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Common Methods of Ontology Engineering 

It is worth mentioning that the contemporary literature 
provides a broad range of approaches to ontology 
engineering. Among multiple ways to classify those 
categories is a “top-down” and a “bottom-up” 
approaches. While the top-down methodology 
emphasizes the reuse of existing domain ontologies 
and upper-level ontologies, the bottom-up 
methodology is used to specifically reveal the “possible 
semantics of the data” on which the ontology is 
constructed [8]. However, the discussion of this matter 
at length lies beyond the scope of this particular work, 
whose goal is to build a case to support the hypothesis 
proposed in the next section. 


Proposed Solution 


Bottom-Up Approach to Ontology Construction 
According to the analysis of Keet [9], the modern 
bottom-up approaches to generating ontologies range 
from fully manual to almost fully automated. The 
almost fully automated approaches rely on natural 
language processing algorithms; however that method, 
according to [9] is only used as the last resort due to 
the ambiguous nature of language. The documented 
approaches rely on highly granular automation, for 
example, part-of-speech tagger is used as a separate 
step. At the same time, this work proposes a fully 
automated approach. The author of this essay 
stumbled upon this idea after multiple failed attempts 
to automate named entity recognition and relation 
extraction from academic medical research articles and 
finally finding a solution by fully switching to an LLM. 


A Synthetic Ontology 

The proposed ontology engineering approach is fully 
automated. It is based on a strict set of prompts that 
have to allow for a limitation of the memory “window” 
of an LLM which has to fit both the prompt and the 
generated text. 


At this stage, one prompt can generate the list of terms 
for the first level of the hierarchy. The second prompt 
— the second level of the hierarchy for each individual 
first-level item of the hierarchy and so on. Then a 
separate prompt has to establish relations between 
categories the previously generated terms fall into. 


The only challenge at the moment, which will likely be 
solved soon, as LLMs get an ever-larger operating 


“memory,” is developing an approach to stich together 
all the patches generated by the LLM into one non- 
contradictory and fully exhaustive whole. 


However, despite this temporary difficulty, the benefits 
of this bottom-up approach are significant. Synthetic 
ontology engineering allows for the creation of more 
comprehensive and accurate knowledge graphs that 
can be personalized to the specific needs of individual 
users. It can capture more complex relationships 
between concepts, and can be updated and refined 
over time as new data becomes available, making it 
more accurate and always up-to-date. Additionally, the 
approach is scalable and adaptable to new contexts 
and data, making it well-suited for applications in 
knowledge domains where the change is especially 
fast. 


Thus, the author proposes that solution rests on at 
least the following key processes: 


“* Storage of context, such as text or graphics, along 
with a term from an ontology. 

** Automatic generation of ontologies based on the 
context and the right prompt for an LLM. 

%* Human verification test, which can later be 
automated. 

** Test of internal consistency. 

“+ Test of domain coverage. 

“+ Aversion control system. 


Preliminary Results 

Model GPT-NEOX-20B: The model 
rudimentary internal consistency given prompts of 
sufficient lengths. It can generate general statements 
about ontologies and very rudimentary, very general 


shows _ basic 


high level conceptual structures. However, it fails to 
generate statements in JSON format, as prompted. 


Model GPT-3.5: The model provides sufficiently well- 
formed answers in JSON format as prompted, correctly 
specifying ontologies. Please see one example below. 
One can easily imagine how it might be possible to 
generate an ontology having processed a corpus of 
texts describing some knowledge domain. An 
interesting artefact is that despite being an 
autoregressive LLM, this model can make “forward- 
looking” links for text it has not yet generated. 


An example of the Taxonomical Part of an Ontology for the Field of Biology 


Prompt: 
Build a taxonomy for the domain of biology that 
consists of 10 broad categories in JSON format. 


Response: 

Sure! Here's an example of a taxonomy for the 
domain of biology, consisting of 10 broad 
categories in JSON format: 


json 


{ 


"taxonomy": "Biology", 
"categories": [ 
{ 
"id" : OL" ; 
"name": "Ecology", 


"children": [] 


"id" : QoQ" A 
"name": "Cellular Biology", 
"children": [] 


"id" : "93" ; 
"name": "Molecular Biology", 
"children": [] 


"id": "04", 
"name": "Genetics", 
"children": [] 


"id" cf "95" y 
"name": "Microbiology", 
"children": [] 


Paice lita "06", 
"name": "Physiology", 
"Children": [] 


"id" 7 "OQO7" ; 
"name": "Anatomy", 
"children": [] 


"id" 7 "os" ; 
"name": "Botany", 
"children": [] 


"id": "09", 
"name": "Zoology", 
"children": [] 


"id": "10", 
"name": "Evolution", 
"children": [] 


In this example, the taxonomy is named 
"Biology" and consists of 10 broad categories: 


Ecology 

Cellular Biology 
Molecular Biology 
Genetics 
Microbiology 
Physiology 
Anatomy 

Botany 

Zoology 
Evolution 


Each category has an ID, a name, and an empty 
list of children. The taxonomy can be expanded 
by adding more subcategories under each of the 


10 broad categories, allowing for more 
detailed classification of biological 
concepts. 


In this example, an LLM generated the first level 
taxonomy. Given the right prompt, it can add relations. 
Obviously, the model is limited by the size of its 
“working memory”; however, this limitation can be in 
part circumvented by prompting the model to provide 
taxonomy and relations separately. Likewise, each 
section can be expanded by providing a specially- 
prepared prompts. 


In the following example, even though it is not 
technically an ontology, you can see how concepts and 
their relations are listed in the same JSON output. 


Given sufficient number of carefully engineered 
prompts, it might be possible to extract the structure of 
any knowledge domain and convert it into an ontology. 


A Sample of Raw / Unedited LLM-Generated Output in JSON Format 


"name": "Future Shock Ontology", 
"concepts": [ "relations": [ 
{ { 
"name": "Accelerating Change", "type": "is-a", 
Wid": 1, "source": 1, 
"description": "The exponential increase "target": 5 
in the pace of change across’ multiple }, 
dimensions, including technological progress, { 
social norms, and economic systems.", "type": "is-a", 
"related concepts": [2, 3, 5] "source": 2, 
}, "target": 5 
{ }, 
"name": "Discontinuity", { 
"id": 2 y "type": "is-a" y 
"description": "The frequent and abrupt "source": 3, 
shifts in social and economic’ systems, "target": 5 
resulting in disorientation and confusion for }, 
individuals and societies.", { 
"related concepts": [1, 3, 5] "type": "is-a", 
}, "source": 4, 
{ "target": 6 
"name": "Overload", }, 
"Nad": 3, { 
"description": "The overwhelming amount "type": "is-a", 
of information and stimulation, leading to "source": 6, 
psychological and social problems.", "target": 5 
"related _concepts": [1, 2, 5] }, 
}, { 
{ "type": "causes", 
"name": "Individualism", "source": 1, 
"id": 4, "target": 2 
"description": "The emphasis on }, 
individualism and personalization, as people { 
seek to cope with the pressures of change and "type": "causes", 
find meaning in their lives.", "source": 1, 
"related_concepts": [1, 3, 6] "target": 3 
}, }, 
{ { 
"name": "Technological Determinism", "type": "causes", 
"aid": Se "source": 2, 
"description": "Technology is seen as "target": 7 
the primary driver of social change, with the }, 
power to shape human behavior and institutions { 
in profound ways.", "type": "causes", 
"related_concepts": [1, 2, 3] "source": 3, 
}, "target": 7 
{ }, 
"name": "Social Fragmentation", { 
"id": 6, "type": "leads-to", 
"description": "The breakdown of "source": 2, 
traditional social institutions and _ the "target": 3 
emergence of new, fluid social forms.", } 
"related_concepts": [4, 5, 7] ] 
}, 
{ 
"name": "Psychological Dislocation", 
"iq": 7, 
"description": "The disorienting effects 
of rapid change lead to psychological 
dislocation and social malaise, with 


individuals struggling to find a sense of 
identity and purpose.", 
"related_concepts": [2, 3, 6] 
} 
], 


Future Work 


There’s a Cambrian explosion of prompts people share 
online. This primordial soup of prompts appears to be 
a new language, a meta-language if you will. The 
prompts can form a protocol layer that is currently 
being formed, akin to smart contracts that run on the 
blockchain. One of the avenues to develop synthetic 
ontologies might be to crowdsource the construction 
of a system of the most effective prompts. 


Furthermore, the current understanding of the solution 
has only been done at a conceptual level. The approach 
needs a thorough mathematical treatment. Both the 
logic behind the approach as well as mathematical 
representation have to be properly addressed in the 
future work. 


The currently available LLMs can handle the number of 
tokens which is already sufficient to create a small 
ontology piece by piece. This number is expected to 
grow significantly. Soon, we no longer will have a blunt 
tool with a small short-term memory that can process 
only a few key phrases, but will be able to process full 
ontologies that could be generated based on a set of 
data and prompts provided to an LLM. 


Conclusion 

As data continues to be a driving force in the digital 
economy, it is critical that we develop effective tools 
for organizing and interpreting it. Synthetic ontology 
engineering may become a significant step forward in 
this regard. This approach allows for the creation of 
ontologies that are not bound by the constraints of pre- 
defined structures, and can instead be tailored to the 
specific needs of a given use case. To test the 
hypothesis proposed in this essay, the best way is to 
choose the field of life sciences as that is a complex and 
challenging field. While this is still a hypothesis that 
needs to be tested and validated, it is a promising 
solution for the future of knowledge management. 
Over three quarters of time used for data preparation 
can instead be spent on more valuable activity thus 
unlocking immense value for the economy. 


The civilization has come a long way from the first 
stone axe tools until it built the first microscope. In the 
same vein, the currently available LLMs represent a 
relatively blunt technology. However, we, still have to 
learn to use this tool before we are able to develop a 
meta-language for programming LLMs. _ Thus, 
developing protocols to build synthetic ontologies 
might be the first step toward that effect. 


Afterword 


Researchers and organizations in the field of 


knowledge management are welcome to our channel 


“Connected Science” { https://discord.gg/fMa35fS2 } 


to join a strong network of like-minded individuals that 
can help you achieve your goals and bring your projects 
to fruition. 
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